NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

COpter: Efficient Large-Scale Resource-Allocation via Continual Optimization

https://doi.org/10.1145/3731569.3764846

Subramanya, Suhas Jayaram; Dennis, Don Kurian; Smith, Virginia; Ganger, Gregory R (October 2025, ACM)

Free, publicly-accessible full text available October 12, 2026
GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

https://doi.org/10.1145/3669940.3707220

Jeon, Byungsoo; Wu, Mengdi; Cao, Shiyi; Kim, Sunghyun; Park, Sunghyun; Aggarwal, Neeraj; Unger, Colin; Arfeen, Daiyaan; Liao, Peiyuan; Miao, Xupeng; et al (March 2025, ACM)

Free, publicly-accessible full text available March 30, 2026
Morph: Efficient File-Lifetime Redundancy Management for Cluster File Systems

https://doi.org/10.1145/3694715.3695981

Kim, Timothy; Athlur, Sanjith; Kadekodi, Saurabh; Maturana, Francisco; Delvira, Dax; Merchant, Arif; Ganger, Gregory R; Rashmi, K V (November 2024, ACM)

Full Text Available
Baleen: ML Admission & Prefetching for Flash Caches

Wong, Daniel Lin-Kit; Wu, Hao; Molder, Carson; Gunasekar, Sathya; Lu, Sathya; Khandkar, Snehal; Sharma, Abhinav; Berger, Daniel S; Beckmann, Nathan; Ganger, Gregory R (February 2024, Usenix)

Flash caches are used to reduce peak backend load for throughput-constrained data center services, reducing the total number of backend servers required. Bulk storage systems are a large-scale example, backed by high-capacity but low-throughput hard disks, and using flash caches to provide a more cost-effective storage layer underlying everything from blobstores to data warehouses. However, flash caches must address the limited write endurance of flash by limiting the long-term average flash write rate to avoid premature wearout. To do so, most flash caches must use admission policies to filter cache insertions and maximize the workload-reduction value of each flash write. The Baleen flash cache uses coordinated ML admission and prefetching to reduce peak backend load. After learning painful lessons with our early ML policy attempts, we exploit a new cache residency model (which we call episodes) to guide model training. We focus on optimizing for an end-to-end system metric (Disk-head Time) that measures backend load more accurately than IO miss rate or byte miss rate. Evaluation using Meta traces from seven storage clusters shows that Baleen reduces Peak Disk-head Time (and hence the number of backend hard disks required) by 12% over state-of-the-art policies for a fixed flash write rate constraint. Baleen-TCO, which chooses an optimal flash write rate, reduces our estimated total cost of ownership (TCO) by 17%. Code and traces are available at https://www.pdl.cmu.edu/CILES/.
more » « less
Full Text Available
Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling

Subramanya, Suhas Jayaram; Arfeen, Daiyaan; Lin, Shouxu; Qiao, Aurick; Jia, Zhihao; Ganger, Gregory R. (October 2023, SOSP)

The Sia1 scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to elastic resource-adaptive jobs. Although some recent schedulers address one aspect or another (e.g., heterogeneity or resource-adaptivity), none addresses all and most scale poorly to large clusters and/or heavy workloads even without the full complexity of the combined scheduling problem. Sia introduces a new scheduling formulation that can scale to the search-space sizes and intentionally match jobs and their configurations to GPU types and counts, while adapting to changes in cluster load and job mix over time. Sia also introduces a low- profiling-overhead approach to bootstrapping (for each new job) throughput models used to evaluate possible resource assignments, and it is the first cluster scheduler to support elastic scaling of hybrid parallel jobs. Extensive evaluations show that Sia outperforms state-of- the-art schedulers. For example, even on relatively small 44- to 64-GPU clusters with a mix of three GPU types, Sia reduces average job completion time ( JCT) by 30–93%, 99th percentile JCT and makespan by 28–95%, and GPU hours used by 12– 55% for workloads derived from 3 real-world environments. Additional experiments demonstrate that Sia scales to at least 2000-GPU clusters, provides improved fairness, and is not over-sensitive to scheduler parameter settings.
more » « less
Full Text Available
RAIZN: Redundant Array of Independent Zoned Namespaces

https://doi.org/10.1145/3575693.3575746

Kim, Thomas; Jeon, Jekyeom; Arora, Nikhil; Li, Huaicheng; Kaminsky, Michael; Andersen, David G.; Ganger, Gregory R.; Amvrosiadis, George; Bjørling, Matias (January 2023, ACM)

Full Text Available
Kangaroo: Caching Billions of Tiny Objects on Flash

https://doi.org/10.1145/3477132.3483568

McAllister, Sara; Berg, Benjamin; Tutuncu-Macias, Julian; Yang, Juncheng; Gunasekar, Sathya; Lu, Jimmy; Berger, Daniel S.; Beckmann, Nathan; Ganger, Gregory R. (October 2021, Symposium on Operating Systems Principles)
null (Ed.)
Full Text Available
PipeDream: generalized pipeline parallelism for DNN training

https://doi.org/10.1145/3341301.3359646

Narayanan, Deepak; Harlap, Aaron; Phanishayee, Amar; Seshadri, Vivek; Devanur, Nikhil R.; Ganger, Gregory R.; Gibbons, Phillip B.; Zaharia, Matei (October 2019, SOSP)

DNN training is extremely time-consuming, necessitating efficient multi-accelerator parallelization. Current approaches to parallelizing training primarily use intra-batch parallelization, where a single iteration of training is split over the available workers, but suffer from diminishing returns at higher worker counts. We present PipeDream, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible. Unlike traditional pipelining, DNN training is bi-directional, where a forward pass through the computation graph is followed by a backward pass that uses state and intermediate data computed during the forward pass. Naïve pipelining can thus result in mismatches in state versions used in the forward and backward passes, or excessive pipeline flushes and lower hardware efficiency. To address these challenges, PipeDream versions model parameters for numerically correct gradient computations, and schedules forward and backward passes of different minibatches concurrently on different workers with minimal pipeline stalls. PipeDream also automatically partitions DNN layers among workers to balance work and minimize communication. Extensive experimentation with a range of DNN tasks, models, and hardware configurations shows that PipeDream trains models to high accuracy up to 5.3X faster than commonly used intra-batch parallelism techniques.
more » « less
Full Text Available

Search for: All records